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Abstract 

In many problems, complex non-Gaussian and/or nonlinear models are required to accurately describe 
a physical system of interest. In such cases, Monte Carlo algorithms are remarkably flexible and extremely 
powerful approaches to solve such inference problems. However, in the presence of a high-dimensional and/or 
multimodal posterior distribution, it is widely documented that standard Monte-Carlo techniques could lead to 
poor performance. In this paper, the study is focused on a Sequential Monte-Carlo (SMC) sampler framework, 
a more robust and efficient Monte Carlo algorithm. Although this approach presents many advantages over 
traditional Monte-Carlo methods, the potential of this emergent technique is however largely underexploited 
in signal processing. In this work, we aim at proposing some novel strategies that will improve the efficiency 
and facilitate practical implementation of the SMC sampler specifically for signal processing applications. 
Firstly, we propose an automatic and adaptive strategy that selects the sequence of distributions within the 
SMC sampler that minimizes the asymptotic variance of the estimator of the posterior normalization constant. 

This is critical for performing model selection in modelling applications in Bayesian signal processing. The 
second original contribution we present improves the global efficiency of the SMC sampler by introducing 
a novel correction mechanism that allows the use of the particles generated through all the iterations of the 
algorithm (instead of only particles from the last iteration). This is a significant contribution as it removes the 
need to discard a large portion of the samples obtained, as is standard in standard SMC methods. This will 
improve estimation performance in practical settings where computational budget is important to consider. 
Keywords: Bayesian inference. Sequential Monte Carlo sampler, complex models. 

I. Introduction 

Bayesian inference is an important area of signal processing, relevant to a wide range of real-world 
applications. As opposed to the point estimators (means, variances) used by classical statistics, Bayesian 
statistics is concerned with generating the posterior distribtition of the unknown parameters given both the 
data and some prior density for these parameters. As such, Bayesian statistics provides a much more complete 
picture of the uncertainty in the estimation of the unknown parameters of a model. 
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In Bayesian inference, the model parameters are regarded as random variables, and the main object of 
interest is the posterior distribution, i.e. the distribution of the parameters given the data. Specifically, the 
posterior density is defined via Bayes’ Theorem as the normalized product of the prior density and the 
likelihood 


p(o |y) 


p{y\0)p(0) 

J E p(y\ d )p( d ) de 


ccp(y\0)p{d), 


(1) 


where E denotes the parameter space of 6. From equation £Q| it is clear that p(0 |y) involves a contribution 
from the observed data through p(y\0), and a contribution from prior information quantified through p(O). 
The posterior p(0\y) contains all relevant information on the unknown parameters 6 given the observed 
data y. All statistical inference can be deduced from the posterior distribution, through appropriate choice of 
summaries. This typically takes the form of evaluating integrals such as, 


J = J (p{0)p(G\y)d6, (2) 

for some integrable function p(O) with respect to the posterior distribution. For example, point estimates for 
unknown parameters are given by the posterior means, i.e., <~p(0) = 6 ; prediction for future data y is based 
on the posterior predictive distribution p(y\0,y), both of which are easily expressed as integral functionals 
of the posterior. 

There are several potential difficulties in any practical implementation of a Bayesian method. One of them is 
the issue of specifying the prior distribution. However, extra difficulties arise in actually calculating the various 
quantities required. First, in applying Bayes’s theorem we need to compute the integral in the denominator. 
Secondly, the process of inference may require the calculation of further integrals of other operations on 
the posterior distribution. These calculations may be difficult to perform in practice, especially in complex 
problems involving a high-dimensional and/or multimodal posterior distribution. These integrals are typically 
approximated using Monte Carlo methods, requiring the ability to sample from general probability distributions 
which can typically be only evaluated up to a normalizing constant. 


A. Existing works 

In many cases, using standard sampling techniques such as inversion or rejection to sample from a target 
distribution (i.e. posterior distribution) is either not possible or can prove to be too much of a computational 
burden. This has led to the development in recent years of much more advanced algorithms which allow one 
to obtain the required samples from the target distribution. Standard approaches are mostly based on Markov 
chain Monte Carlo (MCMC), where the equilibrium distribution of the chain is the target distribution and 
its ergodic mean converges to the expected value [1], MCMC algorithms have been applied with success to 
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many problems, e.g. [2]-[5]. However, there arc two major drawbacks with MCMC methods. Firstly, it is 
difficult to assess when the Markov chain has reached its stationary regime of interest. Secondly, if the target 
distribution is highly multi-modal, MCMC algorithms can easily become trapped in local modes. 

In recent years, more robust and efficient Monte Carlo algorithms have been established in order to 
efficiently explore high dimensional and multimodal spaces. Many of them are population based, in that 
they deal explicitly with a collection of samples at each iteration, including population-based MCMC [6], 
[7] and sequential Monte-Carlo samplers [8]—[10]. In [7], the authors provide a detailed review as well as 
several illustrations showing that such population strategies can lead to significant improvement compared to 
standard MCMC techniques. 

Population-based MCMC was originally developed by Geyer [11]. Further advances came with an evo¬ 
lutionary Monte Carlo algorithm in [12] and [6], each of which attempted to produce genetic algorithm 
type moves to improve the mixing of the Markov chain. It works by simulating a population of several 
Markov chains with different invariant distributions in parallel using MCMC. The population is updated by 
mutation (Metropolis update in one single chain), crossover (partial state swapping between different chains), 
and exchange operators (full state swapping between different chains). However, like standard MCMC, this 
population-based MCMC algorithm still suffers from the difficulty of assessing when the Markov chains have 
reached their stationary regime. 

The second population-based simulation approach is the sequential Monte Carlo sampler proposed in [8], 
[9]. Sequential Monte Carlo (SMC) methods is a class of sampling algorithms which combine importance 
sampling and resampling. They have been primarily used in the “particle filter” setting to solve optimal filtering 
problems; see, for example, [13] and [14] for recent reviews. In this context, SMC methods/particle filters 
have enjoyed wide-spread use in various applications (tracking, computer vision, digital communications) due 
to the fact that they provide a simple way of approximating complex filtering distribution sequentially in time. 
But in [8]—[10], the authors developed a general framework that allows SMC to be used to simulate from 
a single and static target distribution, thus becoming a promising alternative to standard MCMC methods. 
The SMC sampler framework involves the construction of a sequence of artificial distributions on spaces of 
increasing dimensions which admit the distributions of interests as particular marginals. The mechanism is 
similar to sequential importance sampling (resampling) ( [15] and [16]), with one of the crucial differences 
being the framework under which the particles are allowed to move, resulting in differences in the calculation 
of the weights of the particles. 

These methods have several advantages over population-based MCMC methods. Firstly, unlike MCMC, 
SMC methods do not require any burn-in period and do not face the sometimes contentious issue of diagnosing 
convergence of a Markov chain. Secondly, as discussed in [7], compared to population-based MCMC, SMC 
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samplers arc a richer class of methods since there is substantially more freedom in specifying the mutation 
kernels in SMC: kernels do not need to be reversible or even Markov (and hence can be time adaptive). 
Finally, unlike MCMC, SMC samplers provide an unbiased estimate of the unknown normalizing constant 
of the posterior distribution whatever the number of particles used [17]. 

B. Contributions 

Although this approach presents many advantages over traditional MCMC methods, the potential of these 
emergent techniques is however largely underexploited in signal processing. In this paper, we therefore focus 
our study on this technique by aiming at proposing some novel strategies that will improve the efficiency and 
facilitate practical implementation of the SMC sampler. More specifically, we firstly derive some convergence 
results of the SMC sampler for some specific choice of the backward kernel, which is generally used in 
practice, as well as under a perfectly mixing forward kernel. This convergence result facilitates the analysis 
of the SMC sampler and in particular highlights the impact of the choice of the sequence of target distributions 
on the algorithm performance. The first contribution of this paper consists in proposing a adaptive strategy 
in order to obtain an automatic choice of the sequence of intermediate target distributions that optimizes the 
asymptotic variance of the estimator of the marginal likelihood. The second main contribution is the derivation 
of effective schemes in order to improve the global efficiency of the SMC sampler. The idea developed in 
this paper is to propose some correction mechanisms that allow the use of the particles generated through all 
the iterations of the algorithm (instead of only the particles from the last iteration) in order to improve the 
accuracy of the empirical approximation of the target distribution. 

II. SMC Samplers for Bayesian inference 
A. General Principle of SMC Samplers 

Sequential Monte Carlo (SMC) methods are a class of sampling algorithms which combine importance 
sampling and resampling. The SMC sampler is based on two main ideas: 

a) Rather than sampling directly the complex distribution of interest, a sequence of intermediate target 
distributions, { 77 }^, are designed, that transitions smoothly from a simpler distribution to the one of 
interest. In Bayesian inference problems, the target distribution is the posterior ttt(Q) = p(0\z), thus a 
natural choice for such a sequence of intermediate distributions is to select the following [18] 

Ttt(0) = ocp(0)p(y|0)^*, (3) 

where {ft} is a non-decreasing temperature schedule with fo = 0 and fr = 1 and 7 t {0) corresponds 
to the unnormalized target distribution (i.e. 7 1 {0) = p(0)p(y\0)^*) and Z t = p{0)p{y\0) r f’ t d0 is the 
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normalization constant. We initially target the prior distribution 7To = p(6) which is generally easy to 
sample directly from and then introduce the effect of the likelihood gradually in order to obtain at the 
end, t = T, the complex posterior distribution of interest 7 tt(0) = p{9 |y) as target distribution, 
b) The idea is to transform this problem in the standard SMC filtering framework, where the sequence of 
target distributions on the path-space, denoted by { 7 h}f =1 , which admits 7 Tt{xt) as marginals, is defined 
on the product space, i.e., supp(7f i ) = 0 x © x ... x0 = 0*. This novel sequence of joint target 
distributions 7 x t is defined as follows: 




7t(9l:t) 

z t 


(4) 


where 


t—i 


= 7 t{0t) C k (O k+1 ,0 k ), 


(5) 


k =1 


in which the artificial kernels introduced {Ci.} k=l arc called backward Markov kernels since Ct(6t+ 1 , Of) 
denotes the probability density of moving back from 6 t +\ to 0 t . By using such a sequence of extended 
target distributions { 7 h}J =1 based on the introduction of backward kernels {£&}£sequential impor¬ 
tance sampling can thus be utilized in the same manner as standard SMC filtering algorithms. 

Within this framework, one may then work with the constructed sequence of distributions, 7r t , under the 
standard SMC algorithm [16]. In summary, the SMC sampler algorithm therefore involves three stages: 

1) Mutation: , where the particles are moved from 6 t ~ 1 to 0 t via a mutation kernel ]Ct(9t—i . Qf) also called 
forward kernel ; 

2) Correction: , where the particles are reweighted with respect to 7T* via the incremental importance weight 
(Equation ©); and 

3) Selection: , where according to some measure of particle diversity, such as effective sample size, the 
weighted particles may be resampled in order to reduce the variability of the importance weights. 


(m 


In more detail, suppose that at time t — 1, we have a set of weighted particles j 6^_i,W t _ 
approximates 7f t _i via the empirical measure 


?f , 


that 


N 




m= 1 


( 6 ) 


These particles are first propagated to the next distribution using a Markov kernel @t) to obtain 

the set of particles 1 0^ X . Importance Sampling (IS) is then used to correct for the discrepancy between 

1 J m= 1 
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the sampling distribution r)t(0i : t) defined as 




and 7Tf(0i:i). In this case the new expression for the unnormalized importance weights is given by 


*-> c . *.tO . , oM 

' ^ { ‘- 1 " ’ 


j(m) 


W?"*' a 


where wy, termed the (unnormalized) incremental weights, are calculated as, 

/p(m) n('>n)\ _ lt\Pt i“t— l) 


MonA , ) = 


u “t l 


However, as in the particle filter, since the discrepancy between the target distribution fry and the proposal 
r]t increases with t, the variance of the unnormalized importance weights tends therefore to increase as well, 
leading to a degeneracy of the particle approximation. A common criterion used in practice to check this 
problem is the effective sample size ESS which can be computed by: 


E oE 


(m)x2 _ \m= 


\m= 1 _ / 

e (w^\y (wt^ei^y 


If the degeneracy is too high, i.e., the ESS/ is below a prespecified threshold, ESS, then a resampling step is 
performed. The particles with low weights are discarded whereas particles with high weights are duplicated. 
After resampling, the particles arc equally weighted. 

Let us mention two interesting estimates from SMC samplers. Firstly, since fry admits 7r* as marginals by 
construction, for any 1 < t < T , the SMC sampler provides an estimate of this distribution 

N 

tt ?(dO) = J2w t {m) s e ^(de), HD 

m= 1 

and an estimate of any expectations of some integrable function <p(-) with respect to this distribution given 
by 

N 

K?[vm = J2 w t m) M m) )- (i2) 

m= 1 
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Secondly, the estimated ratio of normalizing constants —— 

1 


J7t(0)d0 

f 7t_i (0)d0 


is given by 


_Zt_ 

Z t -i 


N 

m= 1 


(13) 


Z t 

Consequently, the estimate of — is 

■Z’l 


Zt_ 

Zi 


t 


n 


Zk 

Zk -1 


t TV 

n £ w£lw,(o£\,e<rd 


k =2 m=l 


(14) 


If the resampling scheme used is unbiased, then (fT4l) is also unbiased whatever the number of particles used 
[17], Moreover, the complexity of this algorithm is O(N) per time step and it can be easily parallelized. 

Finally, let us note that there exists a few other SMC methods appropriate for static inference such as 
annealed importance sampling [18], the sequential particle filter of [19] and population Monte Carlo [20] but 
all of these methods can be regarded as a special case of the SMC sampler framework. 


B. On the choice of the sequence of target distributions and mutation/backward kernels 

The algorithm presented in the previous subsection is very general. There is a wide range of possible 
choices to consider when designing an SMC sampler algorithm, the appropriate sequence of distributions 
{ttt}i<t<T, the choice of both the mutation kernel {/Q \ : 2<t<T and the backward mutation kernel {£/_ i } J =2 
(for a given mutation kernels), see details in [8]—[10]. In this subsection, we provide a discussion on how to 
choose these parameters of the algorithm in practice. 

1) Sequence of distributions 77 ; There are many potential choices for {77 } leading to various integration 
and optimization algorithms. As a special case, we can set 77 = 7r for all t £ Af. Alternatively, to maximize 
7r(0), we could consider n t (0t) = [it(0t)]^ for an increasing schedule to ensure ttt(0) is concentrated 

around the set of global maxima of Tt{0). In the context of Bayesian inference for static parameters which is 
the main focus of this paper, one can consider 77 ( 0 ) = p(0\y\ ■ • • • , yt), which corresponds to data tempered 
schedule. 

In this paper, we are interested in the likelihood tempered target sequence, that has been proposed in [18], 

n(0) = a p(0)p{y\0)^, (15) 

where {cj) t } is a non-decreasing temperature schedule with fo = 0 and fr = 1. We thus sample initially 
from the prior distribution 7To = p(0) directly and introduce the effect of the likelihood gradually in order to 
obtain at the end t = T an approximation of the posterior distribution p(0 |y). As discussed in [18], tempering 
the likelihood could significantly improve the exploration of the state space in complex multimodal posterior 
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distribution. From Eq. d, the normalizing constant of the posterior target distribution which corresponds 
to the marginal likelihood, p( y), can be approximated with SMC samplers as: 

T y T N 

z ?=n »n e wSwM-i 4 m> ) do 

t =2 ZJt 1 t =2 m= 1 

where Z t = f p(y\0y hl p(0)d0 corresponds to the normalizing constant of the target distribution at iteration 
t (thus Z\ = f p(0)d0 = 1). The approximation of an expectation with respect to the posterior is given by: 

N 

E ^[p{e)] = Y J w T^T ) )- a?) 

2=1 

2) Sequence of mutation kernels JC t : The performance of SMC samplers depends heavily upon the selec¬ 
tion of the transition kernels {/Q}^1 2 and the auxiliary backward kernels {Ct-i}j 2 . There are many possible 
choices for /C/ which have been discussed in [8]—[ 10]. In this study, we propose to employ MCMC kernels 
of invariant distribution ir t for /Q. This is an attractive strategy since we can use the vast literature on the 
design of efficient MCMC algorithms to build a good importance distributions (See [1]). 

More precisely, since we are interested in complex models with potentially high-dimensional and mul¬ 
timodal posterior distribution, a series of Metropolis-within-Gibbs kernels allowing local moves will be 
employed in order to successively move the B sub-blocks of the state of interest, 6 = [, Qi, Q 2 , ■■■ ,Qb\- 
A random walk proposal distribution is used for each sub-block with a multivariate Gaussian distribution as 
proposal: 

6b,t = 6b,t -1 + £b,ti ( 18 ) 


in which is a Gaussian random variable with zero mean and covariance matrix S/, ( . As with any sampling 
algorithm, faster mixing does not harm performance and in some cases will considerably improve it. In the 
particular case of Metropolis-Hastings kernels, the mixing speed relies on adequate proposal scales. As a 
consequence, we adopt the strategy proposed in [21]. The authors applied an idea used within adaptive 
MCMC methods [22] to SMC samplers by using the variance of the parameters estimated from its particle 
system approximation as the proposal scale for the next iteration, i.e., the covariance matrix of the random- 
walk move for the 6-th sub-block at time t is given by: 


N 


= E W'S (4”h - mm-i) (4T, 


m= 1 




(19) 


N 


with 


E jjr(m) (m) 
^t-i 6b,t -1 


m=l 


The motivation is that if TTt-i is close to 7 q (which is recommended for having an efficient SMC algorithm), 
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then the variance estimated at iteration t — 1 will provide a sensible scaling at time t. This adaptive Metropolis 
within Gibbs used in the implementation of the SMC sampler through this paper is summarized in Algorithm 

m 


In difficult problems, other approaches could be added in order to have appropriate scaling adaptation; 
one approach demonstrated in [21] is to simply employ a pair of acceptance rate thresholds and to alter 
the proposal scale given by Eq. (fl9l) whenever the acceptance rate falls outside those threshold values. This 
scheme is to ensure that the acceptance rates in the Metropolis-Hastings steps do not get too large or small. 
Through all this paper, we use this procedure which consists for example to multiply the covariance matrix 
by 5 (resp. 1/5) if the rate exceeded 0.7 (resp. fell below 0.2). 

3) Sequence of backward kernels C ti : The backward kernel Ct. is arbitrary, however as discussed in [8]— 
[10], it should be optimized with respect to mutation kernel /Q to obtain good performance. [8]—[10] establish 
that the backward kernel which minimize the variance of the unnormalized importance weights, Wt, are given 
by 


<?*_!) = 


m(0t) 


( 20 ) 


However, as discussed in Section III-A1 it is typically impossible to use these optimal kernels as they rely 
on marginal distributions of the joint proposal distribution defined in Eq. £7]) which do not admit any closed 
form expression, especially if an MCMC kernel is used as /Q which is ^/-invariant distribution. Thus we can 
either choose to approximate £° pt or choose kernels Ct so that the importance weights are easily calculated 
or have a familiar form. As discussed in [8], [9], if an MCMC kernel is used as forward kernel, the following 
Ct is employed 


&t- 1 ) — 


@t) 


( 21 ) 


71t(0t) 

which is a good approximation of the optimal backward if the discrepancy between Tr t and tt/,_ i is small; 
note that d2lT) is the reversal Markov kernel associated with K',/. In this case, the unnormalized incremental 
weights becomes 




7.-1 (»&’) 




( 22 ) 


This expression (1221) is remarkably easy to compute and valid regardless of the MCMC kernel adopted. Note 
that ft — ft- 1 is the step length of the cooling schedule of the likelihood at time t. As we choose this step 
larger, the discrepancy between tt/ and 7T/._i increases, leading to an increase as the variance of the importance 
approximation. Thus, it is important to construct a smooth sequence of distributions {tt/ } u<t<T by judicious 
choice of an associated real sequence {ft}J = 0 - 

Let us remark that when such backward kernel is used, the unnormalized incremental weights in Eq. (l22l) 


April 23, 2015 


DRAFT 








10 


at time t does not depend on the particle value at time t but just on the previous particle set. In such a case, 
the particles | should be sampled after the weights j wj m> j have been computed and after the particle 

approximation | | has possibly been resampled. 

Based on these discussions regarding the different possible choices, the SMC sampler that will be used for 
Bayesian inference through this paper is summarized in Algorithm [Q 


Algorithm 1 SMC Sampler Algorithm 
1: Initialize particle system 

2: Sample j d[ rr ^ 1 ~ pi(-) an d compute 

l J m—1 

ESS 



^ N 7i(0, O) ) 

\M^)) 



and do resampling if ESS < 


3: for t = 2,..., T do 

4: Computation of the weights: for each m = 1 ,N 


Normalization of the weights : wj m> 


W t {m) = 
= W t (m) 


E U W ^V 


5: Selection: if ESS < ESS then Resample 

6: Mutation: for each m = 1,..., N : Sample 0™ ~ K-t (d[ '"I ; •) where /C*( •; •) is a : r* (■) invariant Markov kernel 

described in more details in Algo. [2] 

7: end for 


Algorithm 2 Adaptive Metropolis-within-Gibbs Kernel /Q(-; •) for the m-th particle 

1: Initialization Set 0° = [g ?,..., g° B ] = 0^ = , g { B }_ J 

2: for i = 1,..., Amcmc do 
3 : for b = 1 ,..., B do 

4: Sample qI ^ A /" (^ _1 , with defined in Eg. fl9l 

5: Compute the Acceptance ratio: 


6 

7 

8 
9 

10 

11 

12 

13 

14 


, * i_i, . r p{y\e*)^ P (e*) 

a(g h ,B h ) = mm <1, — . .. , 


with 0* = [g\,..., Ql_ l: Ql , g ’ h+ \,..., and Q i 1 = [g\,. 

Sample random variate u from /.AO. 1) 
if u < a(0*, 0 i_1 ) then 

el = et 

else 

el = er 1 

end if 
end for 
end for 

Set the new particle value at time t as = [^j v “ CMC ,..., £»^ MCMC ] 


• 6b , ^b+ii ■ ■ • j 
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C. Asymptotic analysis of the SMC sampler-based estimator 

In the remainder of this section, we are interested in the convergence results of SMC Samplers. We aim at 
analyzing the influence of the choice of the sequence of intermediate target distributions in the performance of 
the SMC sampler when resampling is performed before the sampling step. As discussed in the previous section, 
one of the main attractive properties of the SMC sampler is to be able to use some local moves (using an 
MCMC kernel) in order to draw the particles at the next iteration. Such local moves are particularly interesting 
when the state of interest is high-dimensional. As discussed in Section III-A1 when such an MCMC kernel is 
used as a forward kernel in the SMC sampler, /Q, the backward kernel used in order to be able to compute 
the incremental weight is given in Eq. (I2T1 ). Moreover, in order to obtain convergence results that are easy to 
analyze and utilize, we assume that the MCMC kernel used is perfectly mixing, i.e. JCt(0t-i,0t) = (6fj. 

Under these two assumptions, we derive the asymptotic variance of the SMC sampler estimates when 
resampling is performed before the sampling stage at each iteration. 

Proposition 1: Under perfect mixing assumption and if the backward kernel given in Eq. m is used, we 
obtain the following results: 

1) For the expectation estimator: 

N = {E-n-Jv ((f) — E nt (ip )} ^ ff(0,o 2 SMC}t (p)) (23) 

with 

*SMcM = - ^>(0))} = Vmv t M0)) (24) 


2) For the normalizing constant estimator: 


N* 




(25) 


with 


a SMC,t 


At (Oi) 

m(0i) 


d0 1 



it- 1) 


(26) 


Proof: See Appendix. ■ 

Remark 1: As expected, we can conclude from these results that even if a perfect mixing MCMC kernel is 
used, the variance of the estimator associated with the normalizing constant in Eq. (l26l) still depends on all 
the sequence of target distributions as a cumulative sum of the discrepancy between two consecutive target 
distributions. 

In the next section, we will use this result in order to design an automatic procedure for the selection of 
the sequence of target distributions and more especially the evolution of the cooling schedule that defines 
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completely this sequence in Eq. (fl5l) . 


III. Adaptive Sequence of Target Distributions 

A. Existing approaches 

Several statistical approaches have been proposed in order to automatically obtain such a schedule via the 
optimization of some criteria, which are known as on-line schemes. [21] proposed an adaptive selection method 
based on controlling the rate of the effective sample size (ESS*), defined in (flQl ). This scheme thus provides 
an automatic method to obtain the tempering schedule such that the ESS decays in a regular predefined way. 
However, one major drawback of such an approach is that the ESSt of the current sample weights corresponds 
to some empirical measure of the accumulated discrepancy between the proposal and the target distribution 
since the last resampling time. As a consequence, it does not really represent the dissimilarity between each 
pair of successive distributions unless resampling is conducted after every iteration. 

In order to handle this problem, [23] proposed a slight modification of the ESS, named the conditional 
ESS (CESS), by considering how good an importance sampling proposal 1 would be for the estimation 
of expectation under 7r*. At the t-th iteration, this quantity is defined as follows: 


: t — 


N 


T, nw !-i 


w. 


(0 


Ef=i NW& 


2=1 


oo „„eo 


w 


\ 2 1 

- 1 C 

j 

E 


Etl «?>: 


(<> ..,(<> 


2-/7=i n vv t- 1 1 


w. 


U)\ 2 ' 


(27) 


Nevertheless, by using either ESS or CESS criterion, the number of steps T of the SMC samplers 
completely depends on the complexity of the integration problem at hand and is not known in advance. 
In other words, for either fixed ESS* or fixed CESS*, the associated sequence {(j>t\J = \ is an on-line self¬ 
tuning parameter. Smaller values significantly speed up the Sequential Monte Carlo algorithm but lead to a 
higher variation in the results. Consequently, we are not able to control the total complexity of the algorithm, 
and it is typically impossible to obtain the comprehensive view of the behavior of the cooling schedule {(j) t } 
a priori, instead one has to wait until the algorithm is completed. 


B. Proposed adaptive cooling strategy 

In this paper, we propose an alternative strategy to choose the sequence of target distributions adaptively to 
the problem under study. In particular, we propose to consider the sequence of distributions which minimizes 
the variance of the particle approximation of the normalizing constant derived previously in Eq. (l26l) . This 
strategy is thus based on a global optimization of cooling schedule {4> t } which enable us to control the 
complexity of the algorithm by determining before any simulation the number of SMC iterations T. In this 
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way we obtain what will be referred to as an off-line scheme, and we will obtain the complete view of the 
cooling schedule performance before running the SMC sampler. 

1) Objective function and Optimization procedure: By carrying out our criterion, we have to find T — 1 
positive step lengths g = {gt}J_ L 2 > defined as (f—jh-i such that Qt = 1* which minimize the asymptotic 
variance given in Eq. (l26l) . Here, we are aiming at finding 


Q = {f?2, • • •, Qt} 


arg min 

{Q2 



(T-l) 


T 

subject to E Qt = 1 and Vm = 2,..., T : Q m > 0 

t= 2 


(28) 


where 




p(y\6) ( l )t p(d) 
f p(y\0) ( f >t p(d)dd 


p(y\6)^p(e) 

Z t 


t 

with 4> t = Q rn . 
m =2 


(29) 


Equation d28l) involves T — 1 integrals and each integral represents, as discussed in Section III-C1 a 
dissimilarity measure between each pair of successive distributions. The main difficulty in carrying out this 
construction is that these integrals arc generally intractable, typically requiring approximation. 

In order to avoid the use of numerical methods to approximate the T — 1 integrals which could be very 
challenging to do if 6 is high-dimensional, we propose instead to approximate each target distribution 7Tf(0) 
by a multivariate normal distribution. Indeed, from the connection between these integrals and the Renyi 
divergence between two distributions, an analytical expression for the asymptotic variance to minimize can 
be obtained by using the following result [24]: 

For Gaussian multivariate distribution fi = A r (p \. E ]) and fi = J\f(p 2 , £ 2 ) we h ave 

det (aYi 2 + (1 — a)£i)~ 5 

det(£i)V- det(E 2 ) _ = qq) 

x ex P | ^—— {qi ~ /, 2 ) T («E 2 + (1 - a)Si) _1 (Mi - A* 2 )| 

which is finite iff aEjj 1 + (1 — a)'Ef 1 is positive definite. 

Finally, a nonlinear optimization technique, such as for example the Nelder-Mead algorithm [25], can be 
used to solve this optimization in order to obtain the value g. 

2 ) Normal approximations of intermediate target distributions: In order to find the value g that minimizes 
the asymptotic variance of the estimate of the normalizing constant, we need to approximate the T intermediate 
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target distributions, n t for t = 1, ■ • • , T by multivariate normal distributions, i.e.,: 

MO) OC p(y\0)*<p(0) « E t ). (31) 

In order to reduce the complexity associated with these T different normal approximations of the intermediate 
target distributions (which consists in finding both T mean vectors {/q } J =] and covariance matrices {E t\t =1 ), 
we propose to only approximate the prior and the posterior p{6 |y) distribution and thus deduce all normal 
approximation required by using the convenient properties of the normal distribution. 

Indeed, approximating both the prior and the posterior by normal distributions with parameters (/z p , E p ) 
and (fJ-T, Er) respectively, leads to a normal likelihood approximation with 

S/ = (E^ 1 - E' 1 ) -1 , 

(32) 

G'l = S l (E rp HT S p /Xp) . 

Moreover, since a tempered normal is proportional to a normal with only a modification of the covariance 
and also the product of 2 multivariate normals is a multivariate normal distribution, the t- th target distribution 
can therefore be approximated by : 

M0)~M(0\tM,V t ), (33) 


with 


S* = (E- 1 + ^E” 1 ) , 

[J-t = E t (E p 1 fi p + 0tE ; 1 Hi) . 


(34) 


Only the prior and the posterior require normal approximations which can be performed using either 
Laplace’s method [26] (which requires to be able to compute the first and second derivatives) or a simulation- 
based moment matching technique (e.g., using random draws from a simple importance sampler). 


IV. Scheme for Recycling all past simulated particles 

After having proposed, in the previous section, a strategy in order to automatically specify the sequence 
of target distributions that will reduce the asymptotic variance of the estimator of the normalizing constant, 
we now focus on some strategies to improve the estimator of an expectation with respect to 7r(-). 

Using SMC samplers, this quantity is typically approximated with Eq. ( fT71) . as: 

r N ~ 

J = M M0)\ = I TT(6MG)dO « E^S MO)} = J2 w£V(4°), ( 35 ) 

i =1 

since 7 Tt(-) = vr(-). Only the samples from the iterations targeting the true posterior (generally only the last 
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one) are taking into account for the approximation of the expectation. In this paper, in order to reduce the 
variance associated with this estimator in Eq. (l35l) . we propose two different strategies that will use particles 
drawn at the previous iterations of the sampler modified under a “recycling principle”. We remark that these 
two recycling schemes are performed once the T iterations of the SMC sampler are finished. 

A. Recycling based on Effective Sample Size 

As discussed above, the SMC approximation of the posterior expectation is typically only based on the 
samples from the last SMC iteration i.e. from ttt- In order to have a more efficient estimation in the sense 
of minimizing the variance of the estimator in Eq. (l35l) . the idea we propose to explore in this section is 
to recycle all the particles that have been generated through the T iterations of the SMC sampler. This is 
challenging as intermediate samples from the sequence of distributions {TTt}J=i do not target directly the 
posterior of interest ttt- 

In [27], a strategy has been proposed in order to recycle all the elements of the Markov chain obtained from 
a simulated tempering based MCMC algorithm. Here, we propose to adapt this approach to the T collections 
of weighted samples given at each iteration of the SMC sampler. The idea is to correct each of these T set 
of weighted random samples by using an importance sampling identity to adjust these samples which are not 
drawn from the distribution of interest 7r(-). 

More specifically, at the end of the t-th iteration of the SMC sampler, the weighted particle system 
approximates the target distribution 7 rt(-) as follows: 

N 

W®6 g u(M). (36) 

i= 1 

However, in order to be able to use importance sampling identity, we need to have a set of unweighted samples 
from 7 Tt(0). For this purpose, an unbiased resampling step that consists in selecting particles according to their 
importance weights can be used [28]. With a multinomial resampling scheme, we obtain a new collection 

R ?:) y V ~7r t (0), (37) 

r J i= 1 

where for i = 1,..., N 

ef ] = e[ Jl) with 4 ~ M (w t {1) , • ■ •, . (38) 

Let us remark that if the resampling stage has been already performed at a specific iteration of the SMC 
sampler, the previous described steps are not necessary since the obtained samples arc already asymptotically 
drawn from the target distribution 7r t (-) (in this case, we set directly 6^ = o\ l> for i = 1,..., N). At the end 
of the SMC sampler, we have T collections of random samples drawn from each distribution of the targeted 
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sequence. 

Since we know the distribution from which these random samples 1 6) j are sampled, an estimate of 
the expectation in (l35l) can be obtained by using an importance sampling identity: 


N 

' h t = J2 


,(«?>) 


j= 1 Si=l W ESS,t(Ot 


w 


with 


(f)\ _ 7(^ f 




WEsS,t{@t ) — 


(39) 


(40) 


iM j) ) 

where 7 (-) and (•) are the unnormalized target distribution at the final iteration (i.e., the posterior) and at 
the t-th iteration, respectively. 

Finally, an overall estimator that will take into account all these estimators (or potentially a subset 17 among 
these T estimates) can be obtained as follows: 


h — ^ ' Xtht , 
ten 


(41) 


where 0 < X t < = 1. 

As discussed in [27], the combination coefficients X t have to be chosen carefully if we do not want to have 
the variance of the estimator PT1) smaller than the one without recycling given in Eq. (1351) . For example, a 
tempting solution is to take for t € 17: 


Xt — 


w, 


ESS ,t 


w K 


(42) 


with W ESS , t = Ylf=i w ^ss an< 3 Wess = ^Ess,t hut this can lead to very poor performance of 

the resulting estimator as illustrated empirically in the numerical simulation section in which we denote this 
choice by the “naive” recycling scheme. The solution proposed by [27] is thus to find all the A t that maximizes 
the effective sample size of the weights of the entire population of particles. By combining Eqs. (HTl) and 
(|39T) . we can write: 


N 


h - Xt 

ten j= t 




W, 


(43) 


ESS,t 


The effective sample size of the entire population can then be defined as follows: 


ESS(A^)= 



W ESS,t(@, 


'(j)^ 


-\ -1 


w r 


ESS ,t 


(44) 
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As a consequence, the set of coefficient that maximize this effective sample size is the same as 


N ( nn (fitihV 

\* ■ ST' I \ W ESS,t\Vt ) \ 

Aq = arg mm ^ ^ A t - 1 

As2 ten j =1 \ 


W f 


ESS,t 


(45) 


subject to X t = 1. 

ten 


By using Lagrangian multipliers, the optimal A* within the SMC sampler framework are defined, for t € Q, 
by: 


It 


^ F l 

zL men 


with L = 


W 2 i 

tv ESS ,t 


Ell 


(46) 


Let us remark that the value l t involved in this optimal coefficients X* corresponds to the effective sample 
size of the t- th collection of importance weights given in (l40l ) and as a consequence 1 < It < Nt. 


B. Recycling based on Deterministic Mixture Weights 


The previous solution is based on the combination of local estimators obtained by the collection of weighted 
particles from every iterations of the algorithm. In this section, we propose a new strategy that combines 
individual weighted particles by correcting their importance weights. This second scheme we propose is based 
on the principle, called the deterministic mixture weight estimator proposed as in [29] and discussed by Owen 
and Zhou in [30]. 

This approach has been derived in order to combine weighted samples obtained from different proposal 
distributions in the importance sampler framework. More recently, this technique has also been used in the 
Adaptive Multiple Importance Sampling (AMIS) of [31] in order to recycle all past simulated particles in 
order to improve the adaptivity and variance of the Population Monte Carlo algorithm [20]. We propose to 
adapt this technique to the framework of the SMC sampler. 

As discussed in [30], using a deterministic mixture as a representation of the production of the simulated 
samples has the potential to exploit the most efficient proposals in the sequence 771 ( 8 ),..., //r( 8 ) without 
rejecting any simulated value nor sample, while reducing the variance of the corresponding estimators. 
The poorly performing proposal functions are simply eliminated through the reduction of their weights and 
therefore their influence in: 


vr (8 


(*h 


En=l c nt]n 


(i)^ 


(47) 


as T increases (with c n = N r J YlJ= 1 M is the proportion of particles drawn from the proposal // r ,Li Indeed, 
if 771 is the poorly performing proposal, while the r] n ’s (n > 1 ) are good approximations of the target n, for 


'Here we assume the general case in which a different number of particles could be drawn at each iteration of the SMC sampler. 
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a value 6^ such that Tr(d[ t ' > )/r]i(6 { i l] ) is large, because r/i(0 [is small (and not because it is a sample with 
high posterior value), 7r(0^)/{ci7/i(0^) + ... + ct 17 t(&i ' > )} will behave like 7r(#{ , ' ) )/{c 2 r] 2 {0^) + ... + 
ctVt{6 f'*)} and will decrease to zero as T increases. 

In our case, since we are not in the importance sampling framework with well defined proposal distribution 
but instead with T collections of samples from the intermediate target distributions ^ j ,..., j Oj) j 
by following the same resampling step as described in the previous section in Eq. (l38l) . the estimator of an 
expectation using this proposed deterministic mixture will be given by: 

T N t 




(i) 

™DeMi 


t= 1 i= 1 l2k= 1 l2i =1 ^DeM 


ip(0 




(48) 


with 


(0 

W D eM,x,t = 


tt(6 i 




ELlVn(8|' ) )’ 


(49) 


where c r , = N n / YlJ=\ M is the proportion of particles drawn from n n amongst all the simulated particles. 
The problem with this strategy is we need to evaluate the target ?r/ (•) exactly (not up to a constant) and thus we 
need to know the normalizing constant Z t involved in all the intermediate target distributions tt/(•) = 7 t (-)/Z t . 
Hence, at his point the idea we propose is to use the (unbiased) SMC approximation of each normalizing 
constant given by Eq. (fl4l) . As a consequence, the weights of this proposed recycling scheme, defined originally 
in Eq (l49l ). is thus approximated by: 


(i) 




Z-^n= 


71=1 ' 


, ln (ef)Zn V 


(50) 


V. Numerical Simulations 


In this section, the performances of the proposed strategies used to improve the SMC sampler-based 
estimators are assessed through two different models and also for a class of models used widely in signal 
processing, namely penalized regression. In the remainder of this paper, we adopt a parametric form for 
the temperature schedule: <p t = h(t: 7 , T), which satisfies the following conditions: is non-decreasing 

function, = 0 and d>T = 1- This efficiently reduces the optimization problem to a univariate problem of 
finding the optimal value for an unique parameter 7 instead of T — 1 parameters {gt}J= 2 - The parametric 
function used for the proposed adaptive cooling schedule strategy is defined as: 


exp (7 t/T) — 1 
exp( 7 ) - 1 


(51) 
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A. Model 1: Linear and Gaussian Model 

Let us firstly consider a linear and Gaussian model for which the a posteriori distribution as well as the 
marginal likelihood can be derived analytically. The proposed strategies can thus be compared with the optimal 
Bayesian inference method. More precisely, we assume 


p{9)=N{9 |/*,S), 

p{y\0)=N{y\H6,'Z y ). 

For this model, the posterior distribution is given by p(6 |y) = AT(0\n p , S p ) with 

Li P = li + VH t {HVH t + S y ) _1 [y - H/j] , 

= (i ne - y,h t (hvh t + Sp) _1 h\ s. 

In addition, the marginal likelihood (i.e. the normalizing constant) is also know in closed form: 

p(y)=Af(y\Hn,HVH T + V y ). 


(52) 


(53) 


(54) 


For illustration, we select £ = 10-Z"io and p, = Oioxi for the prior distribution. Concerning the likelihood 
parameters, all the elements of the transition matrix have been randomly generated using a standard normal 
distribution and = I riy with a varying number of observations n y . Regarding to the SMC sampler, and in 
particular for the adaptive MWG (summarized in Algo. [2]), we use for the forward kernel: Amcmc = 5 and 
B = 5. 

1) Analysis of the proposed adaptive cooling schedule: Under this model, the proposed approach we 
develop is optimal (in the sense of minimizing the asymptotic variance of the normalizing constant), since 
each intermediate target distribution is a multivariate normal distribution. In Fig. Q] the evolution of the 
theoretical asymptotic variance of the normalizing constant estimator defined in Eq. (1261 ) with the parameter 
value 7 is depicted for the SMC sampler as a function of the number of iterations T and for different number 
of observations. We can clearly see that an optimal value exists for the parametric function of the cooling 
schedule, that will minimize the asymptotic variance. 

In Fig. |3 we compare the theoretical asymptotic variances of the normalizing constant with the ones 
obtained by simulation. In order to obtain these results, we have run 500 times an SMC sampler that utilizes 
a perfect mixing forward kernel which can be straightforwardly obtained analytically for this specific model, 
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i.e.: 


K t (0 t -i,9t) = n(0 t ) (xp(y\d)^p(0) 

= Af(e\» t , E t ), 


with 


IH = l* + ( HEH 1 + — E y ) [y - Hfi] , 


( 55 ) 


(56) 


E t = | I no - EH 1 ( HEH t + —E 


-l 


H E. 


(57) 


From Fig. [3 we can see that the variance of the normalizing constant estimator for the SMC sampler with 
a finite number of particles is very close to the theoretical asymptotic value. Indeed, only few particles are 
required to reach these asymptotic variances under this model. 




(a) 20 Observations 


(b) 30 Observations 


Fig. 1: Evolution of the theoretical asymptotic variance of the SMC sampler estimate of the normalizing 
constant versus the value of 7 in the cooling schedule as the function of the numbers of iterations for 
different number of observations 


Fig. |3] compares the proposed methodology for the optimal adaptive cooling schedule versus two others 
competitors, the one based on the CESS of [23] and the 1 inear cooling schedule often used in practice. 
Performances of the SMC samplers are illustrated for two choices of mutation kernels: the perfect mixing 
kernel (Eq. 1551) or the adaptive random walk Metropolis Hastings kernel. These results clearly show the 
benefit of using such adaptive cooling schedule - a bad choice can lead to a very poor estimate which has 
a large variance. Variance results obtained from the proposed approach and the CESS-based approach are 
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(a) 20 Observations (b) 30 Observations 

Fig. 2: Comparison of the theoretical asymptotic variances (dashed lines) and the empirical ones from SMC 
sampler using perfect mixing Markov Kernel (solid lines) by using the optimal value of the parameter 7 with 
20 Observations. 


comparable. The main advantage of our proposed approach is that we control the global complexity of the 
SMC sampler since we set the number of iterations of the SMC sampler whereas in the CESS-bascd strategy, 
the number of iterations of the SMC samplers will depend on the problem under consideration as well as the 
predefined value of CESS. In order to be able to compare both approaches with the same complexity, several 
runs of the SMC sampler with different values of the CESS have been performed to obtain the CESS value 
that roughly leads to a specific number of iterations T (25, 50 and 100). 

2) Analysis of the proposed recycling schemes: We finally assess the performance of the two proposed 
recycling schemes. In order to analyze the potential gain of recycling past simulated particles, four different 
estimators based on the output of the SMC sampler are compared: “no recycling” given in Eq. (1351) . “Naive 
recycling ” and “ESS-based recycling ” given in Eq. (l41~b with At defined respectively in Eq (l42l) and(|46l) and 
the “DeMix-based recycling” described in Section HV-BI 

Fig. 0] shows the mean squared error (MSE) between the estimated posterior mean and the true posterior 
mean given by fi p in Eq. ( l53l ). We can firstly remark from these results that the naive recycling scheme 
does not really improve the performance of the estimator of the posterior mean. On the contrary, both our 
proposed schemes outperform significantly this naive recycling and the classical estimator that uses only the 
final population of particles (No recycling scheme). The improvement increases with the number of iterations 
used in the SMC sampler, as expected since more collection of particles can be recycled in the estimator. 
These results demonstrate also empirically for this model the superiority of the DeMix recycling approach. 


April 23, 2015 


DRAFT 


















22 




Average Number of iterations T 


(a) Perfect Mixing Kernel 


(b) Adaptive MWG kernel 


Fig. 3: Comparison of the different cooling schedule strategies in terms of the variance of the normalizing 
constant estimate for different number of particles. Results are obtained with the use of either the perfect 
mixing Kernel (left) or the adaptive MWG kernel (right). 




(a) 20 Obs. (b) 40 Obs. 

Fig. 4: Mean square error between the estimated and the true posterior mean for Model 1 using the different 
recycling schemes 
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B. Model 2: Multivariate Student’s t Likelihood 


In this second example, we illustrate the results of our optimal schedule and recycling methods with a 
multivariate Student’s t distribution as likelihood: 


p(d) = M(0\II,v 

p(y\o ) = 




[y - Hd} T 'Z- 1 [y - HO] 


(V + Tly ) 


1 + 


T (§) (un) n y / 2 

This model can be particularly challenging due to the possibility of having a multimodal target posterior 
when contradictory observations are used. To analyze the performance of the proposed scheme in complex 
situation, we use 


H = 


-i T 


110 0 
0 0 11 


l T 


y 1 V2 2/3 2/4 


l T 


S = 20 / 2 , M = C> 2 xi> 5 ]1 = O.IZ 4 and we observe a vector y = 

These particular choices lead to an highly multimodal posterior distribution as illustrated in Fig. [5] for two 
different values of the degree of freedom of the multivariate Student’s t likelihood. From this model and the 
parameters used, we have for both values of the degree of freedom [ 0 ] = [0 0 ] 7 , which is confirmed by 
the numerical evaluation of the posterior shown in Fig. [5] For this model, we will follow the same procedure 
as in the previous Model: analysis of proposed adaptive cooling schedule and then of the proposed recycling 
schemes. In all the numerical simulations presented in this section, we have chosen Nmcmc = 10 and B = 2 
as parameters of the adaptive MWG kernel within the SMC sampler. 

Table U shows the variance of the estimated normalizing constant (i.e., p( y)) when the degree of freedom of 
the multivariate Student’s t distribution is v = 0.2 and v = 7, respectively. We compare the results obtained 
using different cooling schedules. The proposed adaptive approach, the CESS-based one as well as the linear 
cooling schedule yield similar results. From our simulation results, we can see that the proposed adaptive 
procedure takes a very small value (close to 0 ) as optimal value for 7 which thus leads to the linear cooling 
schedule. Same remark when we analyze the evolution of the temperature given by the “on-line” CES 8 - 
based strategy. Nevertheless, we can see that these variances can degrade very significantly if another value 
of 7 is chosen (here we take 7 = 6 ). This clearly demonstrates the impact of this temperature schedule in 
terms of the variance of the normalizing constant. The proposed procedure is thus of great interest in order 
to automatically decide what should be the evolution of this cooling schedule for a given number of SMC 
iterations. 

Fig. shows the mean squared error between the estimated posterior mean from the proposed recycling 
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Fig. 5: Target posterior distribution p(6 |y) in log scale evaluated on a grid with 2 different values for the 
degree of freedom of the Student’s t likelihood 



Linear 

Cooling 

CESS 

Proposed 


Cooling 

7 = 

6 

Approach 

Approach 


7 —» 0 








v = 0.2 

v = 7 

v = 0.2 

vm7 

1/3 0.2 

v=7 

v = 0.2 

v = 7 

N = 50 

0.0026 

0.0146 

0.0110 

0.0375 

0.0028 

0.0177 

0.0030 

0.0209 

T =25 Iter. N = 100 

0.0013 

0.0086 

0.0055 

0.0152 

0.0012 

0.0079 

0.0015 

0.0088 

N = 200 

0.0006 

0.0050 

0.0024 

0.0090 

0.0007 

0.0041 

0.0008 

0.0042 

N = 50 

0.0011 

0.0105 

0.0046 

0.0160 

0.0016 

0.0078 

0.0013 

0.0072 

T =50 Iter. N = 100 

0.0007 

0.0050 

0.0029 

0.0102 

0.0006 

0.0037 

0.0006 

0.0039 

N = 200 

0.0003 

0.0028 

0.0012 

0.0043 

0.0004 

0.0025 

0.0004 

0.0017 

o 

lO 

II 

0.0006 

0.0047 

0.0026 

0.0078 

0.0006 

0.0051 

0.0009 

0.0037 

T =100 Iter. A = 100 

0.0003 

0.0022 

0.0013 

0.0044 

0.0003 

0.0022 

0.0004 

0.0023 

N = 200 

0.0002 

0.0016 

0.0005 

0.0026 

0.0002 

0.0010 

0.0002 

0.0013 


TABLE I: Comparison of the variance of the normalizing constant estimator obtained by using different 
cooling schedules for Model 2 with u = 0.2 and v = 7. 


scheme and the true one. Unlike the previous model (linear and Gaussian one) for v = 0.2, the naive recycling 
outperforms the classical estimator of the SMC sampler when only the last collection of particles is used. 
This could be explained by the shape of the target posterior (Fig. [5]). Indeed, in such a case, the posterior has 
a large region with a non-zero probability in the middle of the “square”. As a consequence, the particles of 
the first iteration of the SMC sampler that target the prior can be very useful. However, when the degree of 
freedom of the likelihood is high (u = 7), this remark does not hold since the posterior is really concentrated 
on 4 modes. From this case, it is also interesting to see that the MSE increases with the number of iterations 
used in the SMC sampler when either no recycling or naive recycling is performed. Indeed, by increasing the 
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number of iterations, we increase also the number of potential resampling steps and we know that during the 
resampling procedure, some particles which are currently located in one of the 4 modes can be discarded. 
Therefore it becomes very difficult for the SMC sampler to jump between two well separated modes, thus 
leading to an unexplored mode by the SMC sampler for the next iteration. This effect does not appear with 
the proposed recycling scheme since we recycle all the past simulated particles. 




Fig. 6 : Mean squared error between the estimated and the true posterior mean for Model 2 using the different 
recycling schemes with T =100 Iterations. 


Finally, in order to emphasize the significant gain that could be obtained using our proposed recycling 
schemes, Table El shows the mean and standard deviation of the Kolmogorov-Smirnov distance defined as 
D = sup 0 i \F n (0i) — F{0\) where F N and F are the empirical cumulative distribution obtained from 
the SMC sampler and the true posterior cumulative distribution, respectively. This distance D is obtained 
through 100 runs of the SMC samplers. Compared to the previous comparisons related to the MSE of the 
posterior mean, this measure give us some information about the quality of the approximation of the whole 
target distribution. In order to obtain these results, the true target cumulative distribution F(0 \) has been 
obtained numerically by using a very fine grid. In both cases (v = 0.2 and v = 7), these results empirically 
demonstrate the significant gain obtained by using the proposed recycling schemes with a slight advantage 
to the DeMix-based approach. The average and the standard deviation of this Kolmogorov-Smirnov distance 
are divided by a factor of 2-3 compared to the case in which we use only the collection of particles from the 
last iteration of the SMC sampler. 
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No Recycling 

Naive 

Recycling 

ESS-based 

Recycling 

DeMix 

Recycling 

N = 50 

T =25 Iter. N = 100 
N = 200 

0.1276 (0.0460) 
0.0835 (0.0224) 
0.0615 (0.0182) 

0.0727 (0.0234) 
0.0488 (0.0164) 
0.0353 (0.0103) 

0.0458 (0.0121) 
0.0366 (0.0094) 
0.0254 (0.0055) 

0.0407 (0.0123) 
0.0315 (0.0089) 
0.0237 (0.0053) 

N = 50 

T =50 Iter. N = 100 
N = 200 

0.1274 (0.0424) 
0.0898 (0.0251) 
0.0627 (0.0188) 

0.0514 (0.0167) 
0.0379 (0.0117) 
0.0267 (0.0068) 

0.0357 (0.0096) 
0.0268 (0.0067) 
0.0201 (0.0045) 

0.0311 (0.0097) 
0.0230 (0.0055) 
0.0185 (0.0037) 

N = 50 

T =100 Iter. N = 100 
N = 200 

0.1186 (0.0352) 
0.0846 (0.0231) 
0.0599 (0.0188) 

0.0391 (0.0117) 
0.0288 (0.0079) 
0.0216 (0.0054) 

0.0315 (0.0066) 
0.0226 (0.0054) 
0.0177 (0.0033) 

0.0243 (0.0060) 
0.0187 (0.0038) 
0.0159 (0.0031) 


TABLE II: Comparison of recycling schemes for the accuracy to approximate the posterior distribution p{9 i|y) 
in terms of the Kolmogorov-Smirnov distance (mean and standard deviation in parentheses) for Model 2 with 


v = 0 . 2 . 


C. Penalized regression model with count data 

In this section, we illustrate how the proposed SMC sampler can be efficiently used in penalized regression 
models with particular focus on counting process observations. Sparse regression analysis initially studied 
in the context of penalized least squares or likelihood has gained increasing popularity since the seminal 
paper on the LASSO [32], Since this work, many approaches under both frequentist and Bayesian have been 
proposed to extend these sparsity inducing regression frameworks. 

In a frequentist setting the most common choice sparsity inducing penalty is the i \-regularization for the 
regression coefficients /3 £ MP and it is known as LASSO, with penalty term 7Xf=i \Pi\- Under a Bayesian 
modelling paradigm, in which the regression coefficients arc treated as a random vector, one may recover the 
LASSO estimates from the maximum a posteriori (MAP) point estimator of the coefficients via a choice of 
prior on the coefficients given by the multivariate Laplace distribution, p(/3) oc exp(— 7Xf=i |A|)- 

A limitation in this approach is the use of identical penalization on each regression coefficient. This 
can lead to unacceptable bias in the resulting estimates [33]. Indeed, the classical £\ -regularization can 
lead to an over-shrinkage of large regression coefficients even in the presence of many zeros. This has 
resulted in sparsity-inducing non-convex penalties that use different penalty coefficients on each regression 
coefficient, i.e. X?=i 7 * IA have been proposed, as have grouping regularization constraints, see adaptive 
and sequential estimation approaches in [34]—[37]. Alternative non-convex approaches include the bridge 
regression framework, i.e. 7 X?=i \Pi\ q w 'th q £ (0,2) which is obtained using the exponential power (EP) 
distribution: 

/<ft 7. 0 = n ^7777 exp (■ 

which leads to the ^-regularization problem [38], [39]. Compared to previous non-convex prior, the latter 
possesses the advantage of not introducing additional tuning variables that need to be selected. 
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More specifically, in this section, we address the challenging regression problem for which the proposed 
strategies can be used as an efficient solution in finding the relationship between continuous input variables 
and count data as response. The likelihood is given by Poisson distribution defined as: 


Vi ~ Vo(yi\m) with 


Hi = exp 


A) + £ Pj*{ (xij) 


3 = 1 


(59) 


and (•) corresponds to the basis function used in the regression. 

For the experiments, the true n y = 100 observations have been generated with regression coefficients set 
to zeros except f3g = 1, fa = 1.5, /?4 = —2, f3% = 1, f3y = —2 and (3g = 1.2. For the basis function, a 
Gaussian kernel defined as <Fjj(xj) = exp | ^ lx '^ Cj ^ 2 1 with 11 equally spaced centers Cj with the same 
scale parameters rj = r = 0.5 have been used. 

The dimension of the parameter vector 6 to estimate is 13: 12 coefficients in (3 and 7 ~ TQ(2, 1.3) which 
corresponds to the unknown scale parameter of the EP distribution defined in Eq. (l58l) . We have chosen 
iVv[CMc = 5 and B = 6 for the adaptive MWG (summarized in Algo. [2]) used in the SMC sampler as forward 
kernel. 

Fig. 0 illustrates that the resulting mean prediction curves (and associated confidence intervals) obtained 
by using the proposed SMC sampler. The true curve is always within the confidence region which clearly 
shows the ability of our algorithm to give a good prediction of the functional relationship between the input 
and output variables. One advantage of Bayesian approach (vs optimization technique like LASSO) is the 
ability to provide a posterior confidence region for the Bayesian estimates. 

Table UTT1 shows the variance of the SMC sampler estimate of the normalizing constant of the target posterior 
distribution, p(y), obtained by using the proposed adaptive cooling schedule and the linear cooling schedule. 
As in previous examples, the variance obtained by using the proposed adaptive cooling schedule is significantly 
lower than the one obtained with a linear cooling schedule. This normalizing constant is a quantity of great 
interest for example whenever the best suitable basis function and/or distribution of the response have to be 
selected from a dictionary of different possible choices. The ability of the proposed adaptive cooling strategy 
to provide an estimator with smaller variance is clearly an important benefit in such cases. 

We now investigate the performance of the SMC sampler to correctly estimate the unknown coefficients of 
regression and as a consequence give some accurate prediction of the functional relationship between the input 
and output variables. Table HV1 clearly demonstrates that our proposed scheme (ESS and DeMix) outperforms 
the two other schemes that were used in these simulations in terms of the stability of the posterior mean 
estimator. Again, the DeMix recycling scheme achieved a slightly better result than the ESS-based strategy 
in terms of variance. 
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Fig. 7: Regression with count data [Prior: EP q = 0.5] by using Demix recycling scheme: true function in 
blue - observed responses in green circles - posterior mean from SMC sampler in red and confidence region 
in gray 5% to 95% percentiles. 



Linear cooling schedule 

Proposed Adaptive 
cooling schedule 



N = 

50 

-221.0070 

± 

124.3479 

-203.3146 

± 

0.8215 

T = 

50 

N = 

200 

-211.8227 

± 

34.1198 

-203.2687 

± 

0.2325 



N = 

50 

-211.4072 

± 

28.2070 

-203.1100 

± 

0.4598 

T = 

100 

N = 

200 

-206.1121 

± 

10.0980 

-203.1244 

± 

0.0698 



N = 

50 

-206.3567 

± 

13.6623 

-202.9167 

± 

0.1627 

T = 

200 

N = 

200 

-204.1015 

± 

3.8083 

-203.0268 

± 

0.0530 


TABLE III: The estimation of the marginal likelihood log p(y\M\) (mean ± variance) in count data regression 
under model M.\ [Prior: EP with q = 0.5]. 



No Recycling 

Naive 

Recycling 

ESS-based 

Recycling 

DeMix 

Recycling 

N = 50 

T = 50 N = 200 

0.00318883 

0.00069063 

0.00318883 

0.00069063 

0.00216641 

0.00050476 

0.00193355 

0.00046064 

N = 50 

T = 100 N = 200 

0.00268185 

0.00079279 

0.00268197 

0.00079267 

0.00148822 

0.00048881 

0.00130908 

0.00040874 

N = 50 

T = 200 N = 200 

0.00308443 

0.00094644 

0.00307905 

0.00094534 

0.00170143 

0.00051125 

0.00149617 

0.00041064 


TABLE IV: Variance of approximated curve in count data regression [Prior: EP with q = 0.5]. 


VI. Conclusion 

In this paper, we discuss the use of SMC samplers for Bayesian inference. A simple form of the asymptotic 
variances for the SMC sampler estimator is derived under some assumptions. From this expression, a novel 
criterion to optimize is described in order to automatically and adaptively decide the cooling schedule of 
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the algorithm. The proposed strategy consists in finding the evolution of the temperature along the SMC 
iterations that will optimize the (asymptotic) variance of the estimator of the normalizing constant of the 
target distribution. Furthermore, we propose two different approaches (ESS and DeMix) that recycle all past 
simulated particles for the final approximation of the posterior distribution. Numerical simulations clearly show 
that significant improvement can be obtained by using these different propositions into the SMC samplers. 

Appendix 

Proof of Proposition |Tj 

In this appendix, we present the proof of Proposition Q] related to the asymptotic variance of the SMC 
sampler estimator when resampling is performed before the sampling step. In [9], the authors does not study 
this case since the resampling cannot always be done before the sampling. In particular, as discussed in 
Section ITl-B I we can do the resampling before the sampling when the weights does not depend on the current 
value of the particle as it is the case when the backward kernel is the one used in this proposition. 

A. On the estimation of an expectation 

This results is quite straightforward to obtain by using classical Monte-Carlo results since we use a perfectly 
mixing kernel, the particles are (asymptotically) drawn at the t-th iteration, for i = 1,..., N: 

O^-TT i(-) ( 60 ) 


which leads to the following particle estimate of the expectation: 

1 N 

= ( 6l ) 

2=1 

All particles are equally weighted since we have performed the resampling before the sampling step. As a 
consequence, we obtain: 

N 2 {E w? ((p) - E fft (<£>)} =► J\f(0, a 2 sMC ,t(v)) (62) 

with ^sMCti^P) th e variance of ip(0) with respect to 7q, i.e. VsMCtiP) = Var„- t (<£>(#)). 


B. On the estimation of the normalizing constant 

In this section, we will derive the asymptotic variance related to the estimator of the normalizing constant. 
Let us firstly study the estimate of the ratio of normalizing constant,Zt/Z t _i, defined in Eq. (fill) which is 
given in the context of the proposition n by: 

m— 1 ra=l Ti—1 \Pt— 1 / 


Zt 

Zt- 


t -1 


(63) 


April 23, 2015 


DRAFT 






30 


since the particles arc equally weighted due to the resampling before the sampling and the unnormalized 
incremental weights are defined in Eq. (l22l) when the backward kernel in Eq. (l2Tb is used. Moreover, owing 
to the perfect mixing assumption, we have: for i = 1 ,..., N: 


a {i) iid / ■> 
e t -1 ~ ’Tt-l(-) 


(64) 


From (1631) and (1641 ). the unbiasedness of this estimator is obvious: 




Zt— 


t-i 






7t_i(0 t _i) 

lt{6t-i) 7t-t(^-t) 
7t_i(0t_i) Z t - 1 


(65) 


dOt -1 = 


Z t - 


t-1 


Let us now study the variance of this estimator: 


Var 


Z t -1 


1 ,V 

— V Var 
iV ^ 


ra=l 


N 


E„ 


7t(#t-i) 

'yt-i(Ot-i) 

lM-i) ' 

LTt-i(^t-t). 


( 66 ) 


— E; 


7t(0t-i) 

7t_i(0 t _i)_ 


In this expression, the mean has already been derived in Eq. (1651) and the second moment can be written as 


E„ 






. 7t(6t—i) 


dOt -1 




^ 2 _l J TT t -l{dt-l) 

which give the following expression for the variance: 


de t _ 


t~ i 


(67) 


Var 


1 ( Zt 


Zt-1 I N U-i 


^-t) 

77—i(0t-7 


■dOt -i — 1 


( 68 ) 


In the results given in Proposition „ we want to have the variance of the log of the normalizing constant 
at time t which can be rewritten using Eq. (fl4l) as 




(69) 


From this expression, we have to obtain the variance of the log ratio of the normalizing constant. This term 
can be obtained by using the delta method [40] that states that if 


m (X n -n)^N(0,a 2 ) 


(70) 
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then for a given function g and a specific value of fi (by assuming that g'{p) exists and is not 0) 


N 2 (g(X n ) - g{fi)) =*■ A/"(0, [g'{ji)] 2 ) 


(71) 


By using this delta method and Eqs. (1681 ) and ( |69[ ). we finally obtain the result presented in Proposition [T] 


i.e.: 


with 



(72) 


(73) 
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